central machine
Distributed Newton Can Communicate Less and Resist Byzantine Workers
We develop a distributed second order optimization algorithm that is communication-efficient as well as robust against Byzantine failures of the worker machines. We propose an iterative approximate Newton-type algorithm, where the worker machines communicate \emph{only once} per iteration with the central machine. This is in sharp contrast with the state-of-the-art distributed second order algorithms like GIANT \cite{giant}, DINGO\cite{dingo}, where the worker machines send (functions of) local gradient and Hessian sequentially; thus ending up communicating twice with the central machine per iteration. Furthermore, we employ a simple norm based thresholding rule to filter-out the Byzantine worker machines. We establish the linear-quadratic rate of convergence of our proposed algorithm and establish that the communication savings and Byzantine resilience attributes only correspond to a small statistical error rate for arbitrary convex loss functions. To the best of our knowledge, this is the first work that addresses the issue of Byzantine resilience in second order distributed optimization. Furthermore, we validate our theoretical results with extensive experiments on synthetically generated and benchmark LIBSVM \cite{libsvm} data-set and demonstrate convergence guarantees.
High-Dimensional Differentially Private Quantile Regression: Distributed Estimation and Statistical Inference
Shen, Ziliang, Wang, Caixing, Wang, Shaoli, Yan, Yibo
With the development of big data and machine learning, privacy concerns have become increasingly critical, especially when handling heterogeneous datasets containing sensitive personal information. Differential privacy provides a rigorous framework for safeguarding individual privacy while enabling meaningful statistical analysis. In this paper, we propose a differentially private quantile regression method for high-dimensional data in a distributed setting. Quantile regression is a powerful and robust tool for modeling the relationships between the covariates and responses in the presence of outliers or heavy-tailed distributions. To address the computational challenges due to the non-smoothness of the quantile loss function, we introduce a Newton-type transformation that reformulates the quantile regression task into an ordinary least squares problem. Building on this, we develop a differentially private estimation algorithm with iterative updates, ensuring both near-optimal statistical accuracy and formal privacy guarantees. For inference, we further propose a differentially private debiased estimator, which enables valid confidence interval construction and hypothesis testing. Additionally, we propose a communication-efficient and differentially private bootstrap for simultaneous hypothesis testing in high-dimensional quantile regression, suitable for distributed settings with both small and abundant local data. Extensive simulations demonstrate the robustness and effectiveness of our methods in practical scenarios.
Distributed Newton Can Communicate Less and Resist Byzantine Workers
We develop a distributed second order optimization algorithm that is communication-efficient as well as robust against Byzantine failures of the worker machines. We propose an iterative approximate Newton-type algorithm, where the worker machines communicate \emph{only once} per iteration with the central machine. This is in sharp contrast with the state-of-the-art distributed second order algorithms like GIANT \cite{giant}, DINGO\cite{dingo}, where the worker machines send (functions of) local gradient and Hessian sequentially; thus ending up communicating twice with the central machine per iteration. Furthermore, we employ a simple norm based thresholding rule to filter-out the Byzantine worker machines. We establish the linear-quadratic rate of convergence of our proposed algorithm and establish that the communication savings and Byzantine resilience attributes only correspond to a small statistical error rate for arbitrary convex loss functions.
Finding Decision Tree Splits in Streaming and Massively Parallel Models
Pham, Huy, Ta, Hoang, Vu, Hoa T.
In this work, we provide data stream algorithms that compute optimal splits in decision tree learning. In particular, given a data stream of observations $x_i$ and their labels $y_i$, the goal is to find the optimal split point $j$ that divides the data into two sets such that the mean squared error (for regression) or misclassification rate (for classification) is minimized. We provide various fast streaming algorithms that use sublinear space and a small number of passes for these problems. These algorithms can also be extended to the massively parallel computation model. Our work, while not directly comparable, complements the seminal work of Domingos and Hulten (KDD 2000).
DDAC-SpAM: A Distributed Algorithm for Fitting High-dimensional Sparse Additive Models with Feature Division and Decorrelation
He, Yifan, Wu, Ruiyang, Zhou, Yong, Feng, Yang
Distributed statistical learning has become a popular technique for large-scale data analysis. Most existing work in this area focuses on dividing the observations, but we propose a new algorithm, DDAC-SpAM, which divides the features under a high-dimensional sparse additive model. Our approach involves three steps: divide, decorrelate, and conquer. The decorrelation operation enables each local estimator to recover the sparsity pattern for each additive component without imposing strict constraints on the correlation structure among variables. The effectiveness and efficiency of the proposed algorithm are demonstrated through theoretical analysis and empirical results on both synthetic and real data. The theoretical results include both the consistent sparsity pattern recovery as well as statistical inference for each additive functional component. Our approach provides a practical solution for fitting sparse additive models, with promising applications in a wide range of domains.
Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation
Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an optimally adaptive procedure for distributed estimation over a range of Besov classes. The results demonstrate significant differences between nonparametric function estimation in the distributed setting and the conventional centralized setting. For global estimation, adaptation in general cannot be achieved for free in the distributed setting. The new technical tools to obtain the exact characterization for the cost of adaptation can be of independent interest.
Distributed function estimation: adaptation using minimal communication
Szabo, Botond, van Zanten, Harry
Distributed methods have attracted a lot of attention in the statistics and machine learning communities recently. There are several reasons for this, the most prominent ones being that they provide a way of dealing with large datasets and with privacy considerations. The theoretical literature on distributed methods is still rather minimal at the moment. A number of papers have recently investigated fundamental performance limits in distributed models, in particular pointing out issues that occur in high-dimensional or nonparametric problems, see for instance [1, 2, 4, 8, 16, 17, 21, 24, 27]. For example, optimal rates in distributed function estimation depend on the amount of communication that is allowed, and the relation of that amount with the regularity of the unknown function. The lower bounds obtained in [25] and [28] and the subsequent adaptation results in [25] show that in particular, automatically adapting to the smoothness of the unknown function is a complicated issue in communication restricted distributed settings. In the present paper we study this problem from a different, we think relevant and interesting perspective, not restricting communication a priori, but asking for rate-optimal procedures that require minimal communication.
Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms
We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under the communication constraints, both in terms of the optimal procedure design and lower bound argument. The techniques developed in this paper can be of independent interest. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design.
Communication-Efficient and Byzantine-Robust Distributed Learning
Ghosh, Avishek, Maity, Raj Kumar, Kadhe, Swanand, Mazumdar, Arya, Ramchandran, Kannan
We develop a communication-efficient distributed learning algorithm that is robust against Byzantine worker machines. We propose and analyze a distributed gradient-descent algorithm that performs a simple thresholding based on gradient norms to mitigate Byzantine failures. We show the (statistical) error-rate of our algorithm matches that of [YCKB18], which uses more complicated schemes (like coordinate-wise median or trimmed mean) and thus optimal. Furthermore, for communication efficiency, we consider a generic class of {\delta}-approximate compressors from [KRSJ19] that encompasses sign-based compressors and top-k sparsification. Our algorithm uses compressed gradients and gradient norms for aggregation and Byzantine removal respectively. We establish the statistical error rate of the algorithm for arbitrary (convex or non-convex) smooth loss function. We show that, in the regime when the compression factor {\delta} is constant and the dimension of the parameter space is fixed, the rate of convergence is not affected by the compression operation, and hence we effectively get the compression for free. Moreover, we extend the compressed gradient descent algorithm with error feedback proposed in [KRSJ19] for the distributed setting. We have experimentally validated our results and shown good performance in convergence for convex (least-square regression) and non-convex (neural network training) problems.
Structure Learning of Sparse GGMs over Multiple Access Networks
Tavassolipour, Mostafa, Karamzade, Armin, Mirzaeifard, Reza, Motahari, Seyed Abolfazl, Shalmani, Mohammad-Taghi Manzuri
A central machine is interested in estimating the underlying structure of a sparse Gaussian Graphical Model (GGM) from datasets distributed across multiple local machines. The local machines can communicate with the central machine through a wireless multiple access channel. In this paper, we are interested in designing effective strategies where reliable learning is feasible under power and bandwidth limitations. Two approaches are proposed: Signs and Uncoded methods. In Signs method, the local machines quantize their data into binary vectors and an optimal channel coding scheme is used to reliably send the vectors to the central machine where the structure is learned from the received data. In Uncoded method, data symbols are scaled and transmitted through the channel. The central machine uses the received noisy symbols to recover the structure. Theoretical results show that both methods can recover the structure with high probability for large enough sample size. Experimental results indicate the superiority of Signs method over Uncoded method under several circumstances.